Using Linguistic Information to Improve the Performance of Vector-Based Semantic Analysis

نویسندگان

  • Magnus Sahlgren
  • David Swanberg
چکیده

In this paper, we will show that the performance of vector-based semantic analysis can be improved by considering basic linguistic structures in the data— e.g. morphology. For this purpose, we have used a new method for vector-based semantic analysis that computes semantic word vectors based on distributed representations by means of random labeling of words in narrow context windows. This form of representation is more natural than previously reported techniques, and, as we will show, equivalent or even superior in performance when subjected to a standardized synonym test. The use of vector-based models of information for the purpose of semantic analysis is an area of research that has gained substantial recognition over the last decade. have demonstrated the viability of computing semantic word vectors from the co-occurrence statistics of words in large text data. However, the prevailing techniques have been almost exclusively statistical, and consequently paid little or no attention to the linguistic structures of the data used in the experiments. This negligence regarding linguistics has, of course, been at least partly deliberate, as one of the primary goals of the techniques has been to develop representations of word meanings from text data " that was minimally preprocessed, not unlike human human-concept acquisition " (Burgess & Lund, 1998). LSA and HAL are both purely statistical methods that treat the text data simply as a bag-of-words in which the only relevant piece of structural information is the words-by-contexts co-occurrence frequencies. What separates the two approaches is their treatment and conception of. In LSA, the text data is represented as a words-by-documents co-occurrence matrix where each cell indicates the frequency of a given word in a given text sample of approximately 150 words. The frequencies are normalized, and the normalized matrix is transformed with Singular Value Decomposition (SVD) into a smaller matrix with reduced dimensionality. The purpose of using SVD to reduce the dimensions of the normalized frequency matrix is that this operation appears to accomplish inductive effects that capture latent semantic structures in the text data. Words are thus represented in the reduced matrix by semantic vectors of £ dimensionality (300 proving to be optimal in Landauer & Dumais' (1997) experiments). In HAL, the data is represented as a words-bywords co-occurrence matrix where each cell indicates the co-occurrence counts for a single word pair (a word pair being an asymmetrical relation so that " ¤ ¢ ¥ " and " ¦ …

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Intelligence-Based Model for Supplier Selection Integrating Data Envelopment Analysis and Support Vector Machine

The importance of supplier selection is nowadays highlighted more than ever as companies have realized that efficient supplier selection can significantly improve the performance of their supply chain. In this paper, an integrated model that applies Data Envelopment Analysis (DEA) and Support Vector Machine (SVM) is developed to select efficient suppliers based on their predicted efficiency sco...

متن کامل

IMPROVE THE RECOMMENDER SYSTEM USING SEMANTIC WEB

To buy his/her necessities such as books, movies, CD, music, etc., one always trusts others’ oral and written consultations and offers and include them in his/her decisions. Nowadays, regarding the progress of technologies and development of e-business in websites, a new age of digital life has been commenced with the Recommender systems. The most important objectives of these systems include a...

متن کامل

The Analysis of Semantic Field in Persian-Speaking Patients With Wernicke’s Aphasia

Objectives: Wernicke’s aphasia is one of the most prominent focal brain deficits affecting the comprehension abilities of patients while preserving their production abilities. Although a lot of studies in different languages have been conducted to analyze the nature of this deficit, still some controversies exist in this regard. While some research studies attribute this defect to a performance...

متن کامل

Analysis of User query refinement behavior based on semantic features: user log analysis of Ganj database (IranDoc)

Background and Aim: Information systems cannot be well designed or developed without a clear understanding of needs of users, manner of their information seeking and evaluating. This research has been designed to analyze the Ganj (Iranian research institute of science and technology database) users’ query refinement behaviors via log analysis.    Methods: The method of this research is log anal...

متن کامل

Covariance Analysis of a vector tracking GPS receiver based on MMSE multiuser Detection

In high dynamic conditions, using vector tracking loops instead of scalar tracking loops in GPS receivers is proved as an efficient method to compensate the performance. The Minimum Mean Squared Error detector as a multiuser detector is applied in the vector tracking loop for more reliability and efficiency. The Kalman filter does the two tasks of tracking and extracting the navigation data aft...

متن کامل

A Joint Semantic Vector Representation Model for Text Clustering and Classification

Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001